SBC validation comparing parameter identification between m_0 and m_1, providing further evidence that the addition of risky choices improves identification of δ.
In Report 4, we observed poor recovery of the utility increment parameters δ in model m_0, and in Report 5, we hypothesized that adding risky choices would improve identification. The parameter recovery experiments in Report 5 suggested improvement, but with only 20 iterations, the results may not be statistically robust.
Simulation-Based Calibration (SBC) provides a more principled approach to assess parameter identification. Rather than asking “can we recover specific true values?” (as in parameter recovery), SBC asks: “does the posterior correctly represent uncertainty about the parameters?” That is, if we repeatedly (i) draw parameters from the prior, (ii) simulate data given those parameters, and (iii) compute the posterior, then the posterior should, on average, be calibrated—assigning the correct probability to regions of parameter space. SBC operationalizes this idea by checking whether the true parameter value occupies a uniform rank position within the posterior samples; systematic departures from uniformity signal that the posterior is too wide, too narrow, or shifted relative to the data-generating process.
NoteThe SBC Principle
If the inference algorithm is correct and the model is identified, then:
Draw θ* from the prior: \(\theta^* \sim p(\theta)\)
Simulate data: \(y \sim p(y | \theta^*)\)
Compute posterior: \(p(\theta | y)\)
Calculate the rank of θ* in the posterior samples
The distribution of thinned ranks should be uniform if the model is well-calibrated. (Thinning is necessary because HMC produces dependent posterior samples; by retaining only every \(t\)-th draw, we reduce autocorrelation and better approximate the independence assumption underlying the rank statistic.) Non-uniformity indicates problems with either the inference algorithm or parameter identification.
Following the Stan User’s Guide (§10.1), we set the number of post-thinned posterior draws \(L\) and the number of SBC iterations \(N\) to be equal, both 999, so that the rank statistic takes one of \(L + 1 = 1000\) possible values. We use a single MCMC chain to avoid any between-chain alignment issues and thin by a factor of 4 to reduce autocorrelation.
0.2 SBC Methodology
0.2.1 Rank Statistics
For each parameter θ, we compute the rank of the true value θ* within the posterior samples {θ₁, θ₂, …, θₛ}:
If the posterior is calibrated, this rank follows a discrete uniform distribution on {0, 1, …, S}.
0.2.2 Diagnostics
We use several complementary diagnostics to assess calibration. Each provides a different lens on the same underlying question—whether the posterior is well-calibrated—so that no single summary statistic drives our conclusions.
Rank histograms: A visual check for uniformity. Each bin of the histogram counts how many SBC simulations produced a thinned rank in a given range. If the posterior is calibrated, all bins should contain roughly the same number of counts (i.e., the histogram should be approximately flat). Systematic patterns—such as a U-shape (overdispersed posteriors) or an inverted-U (underdispersed posteriors)—indicate specific calibration failures.
ECDF plots: The empirical cumulative distribution function of the normalized ranks is plotted against the CDF of a Uniform(0, 1) distribution (the diagonal line). Departures from the diagonal reveal the same calibration issues as the rank histogram, but can be easier to read when the number of SBC simulations is small.
Chi-square goodness-of-fit tests: A formal test of whether the observed bin counts in the rank histogram are consistent with a uniform distribution. The test statistic \(\chi^2 = \sum_b (O_b - E_b)^2 / E_b\) is compared to a \(\chi^2\) distribution with \(B - 1\) degrees of freedom (where \(B\) is the number of bins). A small \(p\)-value (e.g., \(p < 0.05\)) suggests the ranks are not uniform. With 999 SBC simulations and 20 bins, the expected count per bin is approximately 50, which is well within the range where the chi-square approximation is reliable.
Kolmogorov-Smirnov (KS) tests: A non-parametric test comparing the rank ECDF to the Uniform(0, 1) CDF. The KS statistic equals the maximum vertical distance between the two curves. Like the chi-square test, a small \(p\)-value suggests non-uniformity. The KS test does not require binning, but can behave conservatively with discrete rank data.
0.2.3 Study Design
We use matched study designs for m_0 and m_1 to enable fair comparison:
Show code
# Study design configurationsconfig_m0 = {"M": 25, # Number of uncertain decision problems"K": 3, # Number of consequences"D": 5, # Feature dimensions"R": 15, # Distinct alternatives"min_alts_per_problem": 2,"max_alts_per_problem": 5,"feature_dist": "normal","feature_params": {"loc": 0, "scale": 1}}config_m1 = {**config_m0, # Same uncertain problem structure"N": 25, # Risky problems (matching M)"S": 15, # Risky alternatives (matching R)}print("Study Design Comparison:")print(f"\nm_0 (Uncertain Only):")print(f" M = {config_m0['M']} decision problems")print(f" Total choices: ~{config_m0['M'] *3.5:.0f}")print(f"\nm_1 (Uncertain + Risky):")print(f" M = {config_m1['M']} uncertain + N = {config_m1['N']} risky")print(f" Total choices: ~{(config_m1['M'] + config_m1['N']) *3.5:.0f}")
Study Design Comparison:
m_0 (Uncertain Only):
M = 25 decision problems
Total choices: ~88
m_1 (Uncertain + Risky):
M = 25 uncertain + N = 25 risky
Total choices: ~175
Show code
# SBC configuration following Stan User's Guide §10.1 recommendations.# With L post-thinned draws per simulation, the rank statistic takes values# in {0, 1, ..., L}, giving L + 1 possible values. Setting L = N_sbc = 999# means the expected rank histogram is discrete-uniform on 1000 values;# with 20 bins we get ~50 counts per bin—well within the chi-square# approximation's comfort zone—and the KS confidence band shrinks to# ε ≈ 0.043, roughly 3× narrower than the previous n = 100 design.n_sbc_sims =999# Number of SBC iterations (N)n_mcmc_chains =1# Single chain (standard for SBC)thin =4# Thinning factorn_mcmc_samples = thin * n_sbc_sims # 3996 total draws → 999 post-thinnedmax_rank = n_mcmc_samples // thin # 999 (= L)n_bins =20# Histogram binsexpected_per_bin = n_sbc_sims / n_bins # ~49.95print("SBC Configuration:")print(f" Simulations (N): {n_sbc_sims}")print(f" MCMC samples per sim: {n_mcmc_samples}")print(f" Chains: {n_mcmc_chains}")print(f" Thinning factor: {thin}")print(f" Post-thinned draws (L): {max_rank}")print(f" Possible rank values: {max_rank +1}")print(f" Histogram bins: {n_bins}")print(f" Expected counts/bin: {expected_per_bin:.1f}")
SBC Configuration:
Simulations (N): 999
MCMC samples per sim: 3996
Chains: 1
Thinning factor: 4
Post-thinned draws (L): 999
Possible rank values: 1000
Histogram bins: 20
Expected counts/bin: 50.0
0.3 Running SBC for m_0
First, we run SBC for model m_0 (uncertain choices only) to establish a baseline:
Figure 1: SBC rank distributions for α. With 999 SBC simulations the chi-square test is well-powered. See the printed test statistics below for the exact values in this run.
α Uniformity Tests:
m_0: χ² = 17.76, p = 0.539
m_1: χ² = 22.96, p = 0.239
NoteInterpreting the α results
Both models are expected to identify α reasonably well, since sensitivity influences all choice probabilities. With 999 SBC simulations the chi-square test has adequate power to detect moderate departures from uniformity, so the p-values should be taken at face value: a rejection at the 0.05 level is meaningful evidence of miscalibration. If m_0 shows a notably lower p-value for α than m_1, that could indicate a mild calibration issue stemming from the β–δ identification problem propagating to α estimates; the parameter recovery results in Report 4 are relevant context here.
This is the critical comparison. In m_0, the β–δ interaction makes δ difficult to identify from uncertain choices alone (Report 4); in m_1, risky choices supply direct information about utilities that should improve δ calibration:
Figure 2: SBC rank distributions for δ parameters. Compare the shape of the rank histograms across models: departures from uniformity indicate poor calibration. See the formal test results below for quantitative assessment.
δ Parameter Uniformity Tests:
--------------------------------------------------
δ_1:
m_0: χ² = 14.79, p = 0.736 | KS = 0.022, p = 0.709
m_1: χ² = 24.08, p = 0.193 | KS = 0.024, p = 0.603
δ_2:
m_0: χ² = 14.79, p = 0.736 | KS = 0.022, p = 0.709
m_1: χ² = 24.08, p = 0.193 | KS = 0.024, p = 0.603
NoteInterpreting the δ results
Calibration and identification are related but distinct concepts. Calibration refers to whether the posterior accurately represents uncertainty (uniform SBC ranks). Identification refers to whether the data are informative enough to pin down the parameter. Poor identification leads to poor calibration in SBC, because when the data are uninformative the posterior fails to update appropriately from the prior, which manifests as non-uniform ranks (often a central peak, indicating the posterior is too diffuse).
When reading the test statistics, note the following caveats:
Chi-square p-values depend on the particular binning. With 999 simulations and 20 bins, expected counts per bin are approximately 50, so the chi-square approximation is reliable and the test has good power to detect moderate departures from uniformity.
KS test values should in principle differ between models. If they appear identical or very similar, check whether the rank distributions actually differ visually in the histograms and ECDF plots. The KS test can behave conservatively with discrete rank data (since it was designed for continuous distributions), which may reduce its sensitivity to real differences between models.
The rank histograms and ECDF plots are the primary diagnostics; the formal tests supplement but do not replace visual inspection.
0.5.2 ECDF Comparison
The Empirical Cumulative Distribution Function (ECDF) provides another view of calibration. For well-calibrated parameters, the ECDF should follow the diagonal:
Figure 3: ECDF plots for δ parameters. The closer the curve to the diagonal, the better the calibration. The shaded band shows the 95% Kolmogorov-Smirnov confidence region for n=999 simulations. Compare the m_0 and m_1 curves to assess whether adding risky choices improves δ calibration.
The 95% confidence band width is \(\epsilon \approx 0.043\) for \(n=999\) simulations, roughly three times narrower than the \(\epsilon \approx 0.136\) that would obtain with only 100 iterations. This gives us sensitivity to detect relatively small departures from uniformity.
0.5.3 β (Feature Weight Parameters)
Finally, we examine β calibration. Because β and δ interact multiplicatively in the expected utility (see the identification analysis in Report 5), poor identification of δ in m_0 can propagate to β, potentially degrading its calibration as well. In m_1, if risky choices successfully pin down δ, the β parameters may also benefit from improved calibration.
Figure 4: Summary of β parameter SBC calibration. Because β and δ interact in the expected utility, the β–δ identification problem in m_0 may affect β calibration as well. Compare p-values across models to assess whether m_1’s improved δ identification also benefits β.
Table 1: SBC calibration comparison between m_0 and m_1. Higher p-values suggest better calibration (uniformity of rank distribution). With 999 simulations and ~50 expected counts per bin, the chi-square tests are well-powered.
The SBC analysis compares calibration between m_0 and m_1. The key question is whether adding risky choices improves calibration of the δ parameters, which were poorly identified in m_0’s parameter recovery analysis (Report 4).
TipKey Finding: SBC Provides Evidence of Improved δ Calibration
Model m_0 (uncertain choices only):
α: Generally well-calibrated, though some runs may show borderline p-values (see the α discussion above)
β: Calibration may be affected by the β–δ identification interaction; check results against the parameter recovery findings in Report 4
δ: Expected to show poor calibration (non-uniform ranks reflecting identification problems)
Model m_1 (uncertain + risky choices):
α: Generally well-calibrated
β: Expected to benefit from improved δ identification
δ: Expected to show improved calibration (more uniform ranks)
The SBC results provide further evidence that adding risky choices improves δ calibration, consistent with the theoretical identification analysis in Report 5. With 999 SBC simulations, 999 post-thinned posterior draws per simulation, and ~50 expected counts per histogram bin, the formal test statistics are well-powered and the chi-square approximation is reliable.
NoteDistinguishing Identification from Inference Problems
SBC can detect both inference failures (bugs in the sampler) and identification problems (structural non-identifiability). We can distinguish them by checking MCMC diagnostics: if \(\hat{R} \approx 1\) and ESS is adequate but ranks are non-uniform, the issue is identification rather than inference. In our analysis, both models show good MCMC diagnostics, suggesting that non-uniform δ ranks in m_0 (if observed) reflect identification limitations rather than computational issues.
When a parameter is poorly identified, the posterior fails to concentrate around the true value. This manifests in the SBC rank distribution as follows:
The posterior is too wide relative to the prior (weak updating from data)
True values tend to have central ranks (ranks cluster around the median)
The rank histogram shows a characteristic peak in the middle
This pattern—if observed for δ in m_0—would be consistent with the model’s inability to distinguish between different utility functions on the basis of uncertain choices alone.
0.7.3 Why m_1 Is Expected to Improve Calibration
In m_1, risky choices provide direct information about utilities without confounding with subjective probabilities:
With δ better identified, uncertain choices more effectively constrain β
Both choice types constrain α
0.8 Conclusion
Simulation-Based Calibration provides a principled framework for assessing posterior calibration, complementing the parameter recovery analysis of Report 4:
m_0 faces a calibration challenge for δ: The β–δ identification problem documented in Reports 4 and 5 is expected to manifest as non-uniform SBC ranks for δ, and potentially for β as well
m_1 is expected to improve δ calibration: Adding risky choices provides a direct route to utility identification, which should yield more uniform SBC ranks for δ
The improvement is structural: The change is not simply about adding more data, but about the type of data—risky choices provide qualitatively different information than uncertain choices, as formalized in the Anscombe-Aumann framework (Report 5)
The degree of improvement observed in any particular run depends on the random seed and the study design. With 999 SBC simulations the formal tests are well-powered, so readers can rely on both the rank histograms / ECDF plots and the chi-square and KS statistics when drawing conclusions.
NoteOn Statistical Power
With 999 SBC simulations and 20 histogram bins, the expected count per bin is approximately 50, well within the range where the chi-square approximation is reliable. The KS test can still behave conservatively with discrete rank data (since it was designed for continuous distributions), so visual inspection of rank histograms and ECDF plots remains a useful complement to the formal tests.